humaneval+

p-values

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.

Typical delta to give good p-values

We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.

Pairwise wins (including ties)

Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.

Result table

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.

model pass1 win_rate elo
0 opencodeinterpreter-ds-33b 0.738 0.777 1248.264
1 meta-llama-3-70b-instruct 0.720 0.754 1226.816
2 mixtral-8x22b-instruct-v0.1 0.720 0.743 1213.711
3 HuggingFaceH4--starchat2-15b-v0.1 0.713 0.743 1214.953
4 deepseek-coder-7b-instruct-v1.5 0.713 0.742 1213.325
5 opencodeinterpreter-ds-6.7b 0.701 0.715 1186.565
6 xwincoder-34b 0.695 0.706 1177.236
7 speechless-coder-ds-6.7b 0.659 0.653 1131.674
8 code-llama-70b-instruct 0.659 0.647 1125.044
9 white-rabbit-neo-33b-v1 0.659 0.646 1124.281
10 speechless-starcoder2-15b 0.628 0.597 1084.457
11 bigcode--starcoder2-15b-instruct-v0.1 0.604 0.557 1048.541
12 microsoft--Phi-3-mini-4k-instruct 0.591 0.548 1042.756
13 Qwen--Qwen1.5-72B-Chat 0.591 0.539 1035.047
14 code-13b 0.524 0.442 955.455
15 speechless-starcoder2-7b 0.518 0.427 942.880
16 codegemma-7b-it 0.518 0.420 938.507
17 speechless-coding-7b-16k-tora 0.506 0.409 927.939
18 code-33b 0.494 0.392 912.610
19 open-hermes-2.5-code-290k-13b 0.488 0.382 903.425
20 starcoder2-15b-oci 0.433 0.307 836.395
21 codegemma-7b 0.415 0.306 834.815
22 mixtral-8x7b-instruct 0.396 0.266 796.939
23 mistralai--Mistral-7B-Instruct-v0.2 0.360 0.224 753.146
24 gemma-1.1-7b-it 0.354 0.203 727.874
25 octocoder 0.329 0.195 721.395
26 python-code-13b 0.305 0.160 675.949